Image captioning is a comprehensive task in computer vision (CV) and natural language\nprocessing (NLP). It can complete conversion from image to text, that is, the algorithm automatically\ngenerates corresponding descriptive text according to the input image. In this paper, we present an\nend-to-end model that takes deep convolutional neural network (CNN) as the encoder and recurrent\nneural network (RNN) as the decoder. In order to get better image captioning extraction, we propose\na highly modularized multi-branch CNN, which could increase accuracy while maintaining the\nnumber of hyper-parameters unchanged. This strategy provides a simply designed network consists\nof parallel sub-modules of the same structure. While traditional CNN goes deeper and wider to\nincrease accuracy, our proposed method is more effective with a simple design, which is easier to\noptimize for practical application. Experiments are conducted on Flickr8k, Flickr30k and MSCOCO\nentities. Results demonstrate that our method achieves state of the art performances in terms of\ncaption quality.
Loading....